Optimization of n-gram Parameters for Natural Language Processing
نویسندگان
چکیده
In this paper we present the drawbacks of conventional approaches to the estimation of ngram in Chinese natural language processing, that is, the optimization of n-gram parameters is independent of its discriminative capability. To fight with this problem, we bring up with discriminative estimation criterion, on which the parameters of n-grams can be optimized. We implement this approach on the platform of the conversion from Chinese pinyin to Chinese character. We conduct experiments based on the tagged text corpus by Peking University. Experimental results show that the conversion rate can be remarkably raised by at most 41.4%.
منابع مشابه
بازشناسی متون فارسی با استفاده از مدل زبانی n-gram و پالایش گرامری
Abstract Text recognition has been one of the growing research topics in recent years. Many of these researches have focused on recognition of letters and sub-words as a basis for identifying larger text structures such as words, phrases and sentences. This thesis presents a new method in which the recognized sub-words are combined in order to provide meaningful words and sentences in Farsi tex...
متن کاملLanguage Modeling for limited-data domains
With the increasing focus of speech recognition and natural language processing applications on domains with limited amount of in-domain training data, enhanced system performance often relies on approaches involving model adaptation and combination. In such domains, language models are often constructed by interpolating component models trained from partially matched corpora. Instead of simple...
متن کاملUnsupervised Separation of Transliterable and Native Words for Malayalam
Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score...
متن کاملModeling and Optimization of Hybrid HIR Drying Variables for Processing of Parboiled Paddy Using Response Surface Methodology
The effects of hot air temperature (40, 50 and 60 oC) and Radiation Intensity (RI) (0.21, 0.31 and 0.41 w/cm2) on the response variables (drying time, Head Parboiled Rice Yield (HPRY), color value and hardness)) of parboiled rice were investigated. The drying was performed using hybrid hot air–infrared drying. The optimization of drying variables and the relationship b...
متن کاملGraph-Based N-gram Language Identification on Short Texts
Language identification (LI) is an important task in natural language processing. Several machine learning approaches have been proposed for addressing this problem, but most of them assume relatively long and well written texts. We propose a graph-based N-gram approach for LI called LIGA which targets relatively short and ill-written texts. The results of our experimental study show that LIGA ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000